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Abstract —We propose a unified approach for bottom-up hierarchical image segmentation and object proposal generation for 
recognition, called Multiscale Combinatorial Grouping (MCG). For this purpose, we first develop a fast normalized cuts algorithm. 
We then propose a high-performance hierarchical segmenter that makes effective use of multiscale information. Finally, we propose a 
grouping strategy that combines our multiscale regions into highly-accurate object proposals by exploring efficiently their combinatorial 
space. We also present Single-scale Combinatorial Grouping (SCG), a faster version of MCG that produces competitive proposals in 
under five second per image. We conduct an extensive and comprehensive empirical validation on the BSDS500, SegVOC12, SBD, 
and COCO datasets, showing that MCG produces state-of-the-art contours, hierarchical regions, and object proposals. 

Index Terms —Image segmentation, object proposals, normalized cuts. 
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1 Introduction 

T WO paradigms have shaped the field of object recog¬ 
nition in the last decade. The first one, popularized 
by the Viola-Jones face detection algorithm (l), formu¬ 
lates object localization as window classification. The ba¬ 
sic scanning-window architecture, relying on histograms 
of gradients and linear support vector machines, was 
introduced by Dalai and Triggs |2) in the context of 
pedestrian detection and is still at the core of seminal 
object detectors on the PASCAL challenge such as De¬ 
formable Part Models (3|. 

The second paradigm relies on perceptual grouping to 
provide a limited number of high-quality and category- 
independent object proposals, which can then be de¬ 
scribed with richer representations and used as input to 
more sophisticated learning methods. Examples in this 
family are (U, 0. Recently, this approach has dominated 
the PASCAL segmentation challenge |0, 0, 0, 0, im¬ 
proved object detection [10|, fine-grained categorization 
mi and proven competitive in large-scale classification 

da. 

Since the power of this second paradigm is critically 
dependent on the accuracy and the number of object 
proposals, an increasing body of research has delved 
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Fig. 1. Top: original image, instance-level ground truth 
from COCO and our multiscale hierarchical segmenta¬ 
tion. Bottom: our best object proposals among 150. 


into the problem of their generation 1131, EL El/ EL 
El, El, El, El- However, those approaches typically 
focus on learning generic properties of objects from 
a set of examples, while reasoning on a fixed set of 
regions and contours produced by external bottom-up 
segmenters such as l20l , [21 J. 

In this paper, we propose a unified approach to 
multiscale hierarchical segmentation and object proposal 
generation called Multiscale Combinatorial Grouping 
(MCG). Fig.[l]shows an example of our results and Fig. [2] 
an overview of our pipeline. Our main contributions are: 

• An efficient normalized cuts algorithm, which in 
practice provides a 20 x speed-up to the eigenvector 
computation required for contour globalization |[20l , 
[El (Sect. [3d). 

• A state-of-the-art hierarchical segmenter that lever¬ 
ages multiscale information (Sect. |3.3| k 

• A grouping algorithm that produces accurate object 
proposals by efficiently exploring the combinatorial 
space of our multiscale regions (Sect. [5). 
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Fig. 2. Multiscale Combinatorial Grouping. Starting from a multiresolution image pyramid, we perform hierarchical 
segmentation at each scale independently. We align these multiple hierarchies and combine them into a single 
multiscale segmentation hierarchy. Our grouping component then produces a ranked list of object proposals by 
efficiently exploring the combinatorial space of these regions. 


We conduct a comprehensive and large-scale empirical 
validation. On the BSDS500 (Sect.[4| we report significant 
progress in contour detection and hierarchical segmen¬ 
tation. On the VOC2012, SBD, and COCO segmentation 
datasets (Sect. [6j, our proposals obtain overall state-of- 
the-art accuracy both as segmented proposals and as 
bounding boxes. MCG is efficient, its good generaliza¬ 
tion power makes it parameter free in practice, and it 
provides a ranked set of proposals that are competitive 
in all regimes of number of proposals. 

2 Related Work 

For space reasons, we focus our review on recent normal¬ 
ized cut algorithms and object proposals for recognition. 

Fast normalized cuts: The efficient computation of 
normalized-cuts eigenvectors has been the subject of 
recent work, as it is often the computational bottleneck 
in grouping algorithms. Taylor [23] presented a tech¬ 
nique for using a simple watershed oversegmentation 
to reduce the size of the eigenvector problem, sacri¬ 
ficing accuracy for speed. We take a similar approach 
of solving the eigenvector problem in a reduced space, 
though we use simple image-pyramid operations on 
the affinity matrix (instead of a separate segmentation 
algorithm) and we see no loss in performance despite a 
20 x speed improvement. Maire and Yu [24] presented 
a novel multigrid solver for producing eigenvectors at 
multiple scales, which speeds up fine-scale eigenvector 
computation by leveraging coarse-scale solutions. Our 
technique also uses the scale-space structure of an image, 
but instead of solving the problem at multiple scales, 
we simply reduce the scale of the problem, solve it at 
a reduced scale, and then upsample the solution while 
preserving the structure of the image. As such, our 
technique is faster and much simpler, requiring only a 
few lines of code wrapped around a standard sparse 
eigensolver. 


Object Proposals: Class-independent methods that 
generate object hypotheses can be divided into those 
whose output is an image window and those that gen¬ 
erate segmented proposals. 

Among the former, Alexe et al. [16] propose an ob- 
jectness measure to score randomly-sampled image win¬ 
dows based on low-level features computed on the 
superpixels of (2T). Manen et al. G3 propose to use the 
Randomized Prim's algorithm, Zitnick et al. [261 group 
contours directly to produce object windows, and Cheng 
et al. 1271 generate box proposals at 300 images per 
second. In contrast to these approaches, we focus on 
the finer-grained task of pixel-accurate object extraction, 
rather than on window selection. However, by just tak¬ 
ing the bounding box around our segmented proposals, 
our results are also state of the art as window proposals. 

Among the methods that produce segmented propos¬ 
als, Carreira and Sminchisescu fl8l hypothesize a set of 
placements of fore- and background seeds and, for each 
configuration, solve a constrained parametric min-cut 
(CPMC) problem to generate a pool of object hypotheses. 
Endres and Hoiem [191 base their category-independent 
object proposals on an iterative generation of a hierarchy 
of regions, based on the contour detector of |20l and 
occlusion boundaries of [28|. Kim and Grauman El 
propose to match parts of the shape of exemplar objects, 
regardless of their class, to detected contours by (20). 
They infer the presence and shape of a proposal object 
by adapting the matched object to the computed super¬ 
pixels. 

Uijlings et al. [12] present a selective search algorithm 
based on segmentation. Starting with the superpixels 
of El for a variety of color spaces, they produce a set of 
segmentation hierarchies by region merging, which are 
used to produce a set of object proposals. While we also 
take advantage of different hierarchies to gain diversity, 
we leverage multiscale information rather than different 















































3 


color spaces. 

Recently, two works proposed to train a cascade of 
classifiers to learn which sets of regions should be 
merged to form objects. Ren and Shankhnarovich f29l 
produce full region hierarchies by iteratively merging 
pairs of regions and adapting the classifiers to different 
scales. Weiss and Taskar [30] specialize the classifiers also 
to size and class of the annotated instances to produce 
object proposals. 

Malisiewicz and Efros fU took one of the first steps 
towards combinatorial grouping, by running multiple 
segmenters with different parameters and merging up 
to three adjacent regions. In (8), another step was taken 
by considering hierarchical segmentations at three differ¬ 
ent scales and combining pairs and triplets of adjacent 
regions from the two coarser scales to produce object 
proposals. 

The most recent wave of object proposal algorithms 
is represented by [131, [14], and [15l|, which all keep the 
quality of the seminal proposal works while improving 
the speed considerably. Krahenbiihl and Koltun lH3l 
find object proposal by identifying critical level sets 
in geodesic distance transforms, based on seeds placed 
in learnt places in the image. Rantalankila et al. Ifl4l 
perform a global and local search in the space of sets 
of superpixels. Humayun et al. H5) reuse a graph to 
perform many parametric min-cuts over different seeds 
in order to speed the process up. 

A substantial difference between our approach and 
previous work is that, instead of relying on pre¬ 
computed hierarchies or superpixels, we propose a uni¬ 
fied approach that produces and groups high-quality 
multiscale regions. With respect to the combinatorial ap¬ 
proaches of [U, 0, our main contribution is to develop 
efficient algorithms to explore a much larger combinato¬ 
rial space by taking into account a set of object examples, 
increasing thus the likelihood of having complete objects 
in the pool of proposals. Our approach has therefore 
the flexibility to adapt to specific applications and types 
of objects, and can produce proposals at any trade-off 
between their number and their accuracy. 

3 The Segmentation Algorithm 

Consider a segmentation of the image into regions that 
partition its domain S = {5Y} Z . A segmentation hierarchy 
is a family of partitions {<S*, S 1 ,..., <S L } such that: (1) 
<S* is the finest set of superpixels , (2) S L is the complete 
domain, and (3) regions from coarse levels are unions of 
regions from fine levels. A hierarchy where each level S l 
is assigned a real-valued index \ can be represented by a 
dendrogram, a region tree where the height of each node 
is its index. Furthermore, it can also be represented as an 
ultrametric contour map (UCM), an image obtained by 
weighting the boundary of each pair of adjacent regions 
in the hierarchy by the index at which they are merged 
ed, m. This representation unifies the problems of 
contour detection and hierarchical image segmentation: 



Fig. 3. Duality between a UCM and a region tree: 

Schematic view of the dual representation of a seg¬ 
mentation hierarchy as a region dendrogram and as an 
ultrametric contour map. 


a threshold at level A* in the UCM produces the segmen¬ 
tation S\ 

Figure [3] schematizes these concepts. First, the lower 
left corner shows the probability of boundary of a UCM. 
One of the main properties of a UCM is that when we 
threshold the contour strength at a certain value, we 
obtain a closed boundary map, and thus a partition. 
Thresholding at different A*, therefore, we obtain the 
so-called merging-sequence partitions (left column in 
Figure [3); named after the fact that a step in this sequence 
corresponds to merging the set of regions sharing the 
boundary of strength exactly A 

For instance, the boundary between the wheels and 
the floor has strength Ai, thus thresholding the contour 
above Ai makes the wheels merge with the floor. If we 
represent the regions in a partition as nodes of a graph, 
we can then represent the result of merging them as their 
parent in a tree. The result of sweeping all A* values 
can therefore be represented as a region tree, whose root 
is the region representing the whole image (right part 
of Figure [3}. Given that each merging is associated with 
a contour strength, the region tree is in fact a region 
dendogram. 

As an example, in the gPb-ucm algorithm of [20], 
brightness, color and texture gradients at three fixed disk 
sizes are first computed. These local contour cues are 
globalized using spectral graph-partitioning, resulting in 
the gPb contour detector. Hierarchical segmentation is 
then performed by iteratively merging adjacent regions 
based on the average gPb strength on their common 
boundary. This algorithm produces therefore a tree of 
regions at multiple levels of homogeneity in brightness, 
color and texture, and the boundary strength of its UCM 
can be interpreted as a measure of contrast. 

Coarse-to-fine is a powerful processing strategy in 
computer vision. We exploit it in two different ways 
to develop an efficient, scalable and high-performance 
segmentation algorithm: (1) To speed-up spectral graph 
partitioning and (2) To create aligned segmentation hi¬ 
erarchies. 
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3.1 Fast Downsampled Eigenvector Computation 

The normalized cuts criterion is a key globalization 
mechanism of recent high-performance contour detec¬ 
tors such as [20], [22]. Although powerful, such spectral 
graph partitioning has a significant computational cost 
and memory footprint that limit its scalability In this 
section, we present an efficient normalized cuts approx¬ 
imation which in practice preserves full performance for 
contour detection, while having low memory require¬ 
ments and providing a 20 x speed-up. 

Given a symmetric affinity matrix A, we would like 
to compute the k smallest eigenvectors of the Laplacian 
of A. Directly computing such eigenvectors can be very 
costly even with sophisticated solvers, due to the large 
size of A. We therefore present a technique for efficiently 
approximating the eigenvector computation by taking 
advantage of the multiscale nature of our problem: A 
models affinities between pixels in an image, and images 
naturally lend themselves to multiscale or pyramid-like 
representations and algorithms. 

Our algorithm is inspired by two observations: 1) if 
A is bistochastic (the rows and columns of A sum to 1) 
then the eigenvectors of the Laplacian A are equal to 
the eigenvectors of the Laplacian of A 2 , and 2) because 
of the scale-similar nature of images, the eigenvectors 
of a "downsampled" version of A in which every other 
pixel has been removed should be similar to the eigen¬ 
vectors of A. Let us define pixel_decimate (A), which 
takes an affinity matrix A and returns the set of indices 
of rows/columns in A corresponding to a decimated 
version of the image from which A was constructed. That 
is, if i = pixel_decimate (A), then A [i, i\ is a decimated 
matrix in which alternating rows and columns of the im¬ 
age have been removed. Computing the eigenvectors of 
A [z, i\ works poorly, as decimation disconnects pixels in 
the affinity matrix, but the eigenvectors of the decimated 
squared affinity matrix A 2 [i, i] are similar to those of A, 
because by squaring the matrix before decimation we 
intuitively allow each pixel to propagate information to 
all of its neighbors in the graph, maintaining connec¬ 
tions even after decimation. Our algorithm works by 
efficiently computing A 2 [i, i] as A [:, i\ A [:, iQ(the naive 
approach of first squaring A and then decimating it is 
prohibitively expensive), computing the eigenvectors of 
A 2 [?', ?;], and then "upsampling" those eigenvectors back 
to the space of the original image by pre-multiplying 
by A[:,i]. This squaring-and-decimation procedure can 
be applied recursively several times, each application 
improving efficiency while slightly sacrificing accuracy. 

Pseudocode for our algorithm, which we call 
"DNCuts" (Downsampled Normalized Cuts) is given in 
Algorithm [lj where A is our affinity matrix and d is 
the number of times that our squaring-and-decimation 
operation is applied. Our algorithm repeatedly applies 
our joint squaring-and-decimation procedure, computes 

1. The Matlab-like notation A [:,i] indicates keeping the columns of 
matrix A whose indices are in the set i. 


Algorithm 1 dncuts(A, d, fc) 

1: Aq i — A 

2: for s = [1, 2,..., d] do 

3: i s <— pixel_decimate (A s _i) 

4: B s i — A s _i [:, i s } 

5: Cg diag(L? s l) _1 L> s 

6: Ag i — Cf B s 

7: end for 

8: Xd <— ncuts(Ad, k) 

9: for s = [d, d — 1,..., 1] do 
10: X s — i CgX s 

11: end for 

12: return whiten(X 0 ) 



Fig. 4. Example of segmentation projection. In order to 
“snap” the boundaries of a segmentation 1Z (left) to those 
of a segmentation S (middle), since they do not align, we 
compute 7t(7£, S) (right) by assigning to each segment in 
S its mode among the labels of U. 


the smallest k eigenvectors of the final "downsam¬ 
pled" matrix A d by using a standard sparse eigensolver 
ncuts (Ad,k), and repeatedly "upsamples" those eigen¬ 
vectors. Because our A is not bistochastic and decimation 
is not an orthonormal operation, we must do some 
normalization throughout the algorithm (line 5) and 
whiten the resulting eigenvectors (line 10). We found that 
values of d = 2 or d = 3 worked well in practice. Larger 
values of d yielded little speed improvement (as much of 
the cost is spent downsampling Ao) and start negatively 
affecting accuracy. Our technique is similar to Nystrom's 
method for computing the eigenvectors of a subset of A, 
but our squaring-and-decimation procedure means that 
we do not depend on long-range connections between 
pixels. 

3.2 Aligning Segmentation Hierarchies 

In order to leverage multi-scale information, our ap¬ 
proach combines segmentation hierarchies computed 
independently at multiple image resolutions. How¬ 
ever, since subsampling an image removes details and 
smooths away boundaries, the resulting UCMs are mis¬ 
aligned, as illustrated in the second panel of Fig. 2. In this 
section, we propose an algorithm to align an arbitrary 
segmentation hierarchy to a target segmentation and, 
in Sect. 5, we show its effectiveness for multi-scale 
segmentation. 

The basic operation is to "snap" the boundaries of a 
segmentation 1Z = {Ri}i to a segmentation S = {Sj}j, as 
illustrated in Fig. [1] For this purpose, we define C(Sj), 
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the new label of a region S 3 e S, as the majority label of 
its pixels in 7 Z: 

C(Sj ) = arg max J — — (1) 

i P j I 

We call the segmentation defined by this new labeling of 
all the regions of S the projection of 1Z onto S and denote 
it by 7r(7£,<S). 

In order to project an UCM onto a target segmentation 
S, which we denote 7r(UCM, S), we project in turn each 
of the levels of the hierarchy onto S. Note that, since 
all the levels are projected onto the same segmentation, 
the family of projections is by construction a hierar¬ 
chy of segmentations. This procedure is summarized in 
pseudo-code in Algorithm [2] 

Algorithm 2 UCM Rescaling and Alignment 

Require: An UCM with a set of levels [U, 

Require: A target segmentation <S* 

1: UCM^ i — 0 

2: for t = [ti, do 

3: S 4— sampleHierarchy(UCM, t) 

4: S rescaleSegmentation(<S, <S*) 

5: <S<-tt(<S,<S*) 

6: contours 4— extractBoundary(<S) 

7: UCM^ max(UCM 7r , t * contours ) 

8 : end for 
9: return UCM^ 


Observe that the routines sampleHierarchy and 
extractBoundary can be computed efficiently because 
they involve only thresholding operations and connected 
components labeling. The complexity is thus dominated 
by rescaleSegmentation in Step 4, a nearest neighbor 
interpolation, and the projection in Step 5, which are 
computed K times. 

3.3 Multiscale Hierarchical Segmentation 

Single-scale segmentation: We consider as input 
the following local contour cues: (1) brightness, color and 
texture differences in half-disks of three sizes (33), (2) 
sparse coding on patches |23, and (3) structured forest 
contours (34J- We globalize the contour cues indepen¬ 
dently using our fast eigenvector gradients of Sect. |3.1[ 
combine global and local cues linearly, and construct 
an UCM based on the mean contour strength. We tried 
learning weights using gradient ascent on the F-measure 
on the training set [[20], but evaluating the final hierar¬ 
chies rather than open contours. We observed that this 
objective favors the quality of contours at the expense of 
regions and obtained better overall results by optimizing 
the Segmentation Covering metric |20]. 

Hierarchy Alignment: We construct a multiresolu¬ 
tion pyramid with N scales by subsampling / super¬ 
sampling the original image and applying our single¬ 
scale segmenter. In order to preserve thin structures 
and details, we declare as set of possible boundary 


locations the finest superpixels in the highest-resolution. 
Then, applying recursively Algorithm [2j we project each 
coarser UCM onto the next finer scale until aligning it 
to the highest resolution superpixels. 

Multiscale Hierarchy: After alignment, we have a 
fixed set of boundary locations, and N strengths for 
each of them, coming from the different scales. We 
formulate this problem as binary boundary classification 
and train a classifier that combines these N features into 
a single probability of boundary estimation. We exper¬ 
imented with several learning strategies for combining 
UCM strengths: (a) Uniform weights transformed into 
probabilities with Platt's method, (b) SVMs and logistic 
regression, with both linear and additive kernels, (c) 
Random Forests, (d) The same algorithm as for single¬ 
scale. We found the results with all learning methods 
surprisingly similar, in agreement with the observation 
reported by l33l . This particular learning problem, with 
only a handful of dimensions and millions of data points, 
is relatively easy and performance is mainly driven by 
our already high-performing and well calibrated fea¬ 
tures. We therefore use the simplest option (a). 

4 Experiments on the BSDS500 

We conduct extensive experiments on the BSDS500 Il35l . 
using the standard evaluation metrics and following the 
best practice rules of that dataset. We also report results 
with a recent evaluation metric F op m, Ezi, Precision- 
Recall for objects and parts, using the publicly-available 
code. 

Single-scale Segmentation: Table [l]-top shows the 
performance of our single-scale segmenter for different 
types of input contours on the validation set of the 
BSDS500. We obtain high-quality hierarchies for all the 
cues considered, showing the generality of our approach. 
Furthermore, when using them jointly (row 'Comb.' 
in top panel), our segmenter outperforms the versions 
with individual cues, suggesting its ability to leverage 
diversified inputs. In terms of efficiency, our fast nor¬ 
malized cuts algorithm provides an average 20 x speed¬ 
up over l20l , starting from the same local cues, with 
no significant loss in accuracy and with a low memory 
footprint. 

Multiscale Segmentation: Table [l]-bottom evaluates 
our full approach in the same experimental conditions as 
the upper panel. We observe a consistent improvement 
in performance in all the metrics for all the inputs, which 
validates our architecture for multiscale segmentation. 
We experimented with the range of scales and found 
N = {0.5,1,2} adequate for our purposes. A finer 
sampling or a wider range of scales did not provide 
noticeable improvements. We tested also two degraded 
versions of our system (not shown in the table). For 
the first one, we resized contours to the original image 
resolution, created UCMs and combined them with the 
same method as our final system. For the second one, we 
transformed per-scale UCMs to the original resolution. 
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Boundary 11 Region 



Input 

ODS 

b 

OIS 

F, 

ODS 

>P OIS 

SC 

ODS OIS 

pRI 

ODS OIS 

VI 

ODS OIS 


Pb [33j 

0.702 

0.733 

0.334 

0.370 

0.577 

0.636 

0.801 

0.847 

1.692 

1.490 


SC [22] 

0.697 

0.725 

0.264 

0.306 

0.540 

0.607 

0.777 

0.835 

1.824 

1.659 


SF 1341 

0.719 

0.737 

0.338 

0.399 

0.582 

0.651 

0.803 

0.851 

1.608 

1.432 

1 

Comb. 

0.719 

0.750 

0.358 

0.403 

0.602 

0.655 

0.809 

0.855 

1.619 

1.405 

ij 

Pb [33] 

0.713 

0.745 

0.350 

0.389 

0.598 

0.656 

0.807 

0.856 

1.601 

1.418 

s 

sc [ 22 ] 

0.705 

0.734 

0.331 

0.384 

0.579 

0.647 

0.799 

0.851 

1.637 

1.460 

3 

SF |34l 

0.725 

0.744 

0.370 

0.420 

0.600 

0.660 

0.810 

0.854 

1.557 

1.390 


Comb. 

0.725 

0.757 

0.371 

0.408 

0.611 

0.670 

0.813 

0.862 

1.548 

1.367 


TABLE 1 

BSDS500 val set. Control experiments for single-scale 
(top) and multiscale (bottom) hierarchical segmentation 
with different input contour detectors 


but omitted the strength transfer to the finest superpixels 
before combining them. The first ablated version pro¬ 
duces interpolation artifacts and smooths away details, 
while the second one suffers from misalignment. Both 
fail to improve performance over the single-scale result, 
which provides additional empirical support for our 
multiscale approach. We also observed a small degra¬ 
dation in performance when forcing the input contour 
detector to use only the original image resolution, which 
indicates the advantages of considering multiscale infor¬ 
mation at all stages of processing. 

Since there are no drastic changes in our results when 
taking as input the different individual cues or their com¬ 
bination, in the sequel we use the version with structured 
forests for efficiency reasons, which we denote MCG- 
UCM-Our. 

Comparison with state-of-the-art.: Figure [5] com¬ 
pares our multiscale hierarchical segmenter MCG (—•—) 
and our single-scale hierarchical segmenter SCG (— ©— ( 
on the BSDS500 test set against all the methods for 
which there is publicly available code. We also compare 
to the recent ISCRA If29l hierarchies EEt provided 
precomputed by the authors. We obtain consistently the 
best results to date on BSDS500 for all operating regimes, 
both in terms of boundary and region quality. 

Note that the range of object scales in the BSDS500 
is limited, which translates into modest absolute gains 
from MCG S with respect to SCG ( | — 0 — [ > in terms 
of boundary evaluation (left-hand plot), but more sig¬ 
nificant improvements in terms of objects and parts 
(right-hand plot). We will also observe more substantial 
improvements with respect to gPb-UCM EEL when we 
move to PASCAL, SBD, and COCO in Section [6] (e.g. see 

Fig# 

Ground-Truth Hierarchy: In order to gain further 
insights, we transfer the strength of each ground-truth 
segmentation to our highest-resolution superpixels S N * 
and construct a combined hierarchy. This approxima¬ 
tion to the semantic hierarchy, Ground-Truth Hierarchy 
(GTH) in Fig. |5| is an upper-bound for our approach as 
both share the same boundary locations and the only 
difference is their strength. Since the strength of GTH 


is proportional to the number of subjects marking it, it 
provides naturally the correct semantic ordering, where 
outer object boundaries are stronger than internal parts. 

Recently, Maire et al. l38l developed an annotation 
tool where the user encodes explicitly the "perceptual 
strength" of each contour. Our approach provides an 
alternative where the semantic hierarchy is reconstructed 
by sampling flat annotations from multiple subjects. 

5 Object Proposal Generation 

The image segmentation algorithm presented in the 
previous sections builds on low-level features, so its 
regions are unlikely to represent accurately complete 
objects with heterogeneous parts. In this context, object 
proposal techniques create a set of hypotheses, possibly 
overlapping, which are more likely to represent full 
object instances. 

Our approach to object proposal generation is to 
combinatorially look for sets of regions from our seg¬ 
mentation hierarchies that merged together are likely 
to represent complete objects. In this section, we first 
describe the efficient computation of certain region de¬ 
scriptors on a segmentation tree. Then, we describe how 
we use these techniques to efficiently explore the sets of 
merged regions from the hierarchy. Finally, we explain 
how we train the parameters of our algorithm for object 
proposals and how we rank the candidates by their 
probability of representing an object. 

Fast computation of descriptors: Let us assume, for 
instance, we want to compute the area of all regions 
in the hierarchy. Intuitively, working strictly on the 
merging-sequence partitions, we would need to scan all 
pixels in all partitions. On the other hand, working on 
the region tree allows us to scan the image only once to 
compute the area of the leaves, and then propagate the 
area to all parents as the addition of the areas of their 
children. 

As a drawback, the algorithms become intricate in 
terms of coding and necessary data structures. Take, for 
instance, the computation of the neighbors of a certain 
region, which is trivial via scanning the partition on the 
merging sequence (look for region labels in the adjacent 
boundary pixels), but need tailored data structures and 
algorithms in the region tree. 

Formally, let us assume the image has p pixels, and we 
build a hierarchy based on 5 superpixels (leaves of the 
tree), and m mergings (different UCM strength values). 
The cost of computing the area on all regions using the 
merging-sequence partitions will be the cost of scanning 
all pixels in these partitions, thus p-(m- hi). In contrast, 
the cost using the region tree will involve scanning the 
image once, and then propagating the area values, so 
p+ra, which is notably faster. 

We built tailored algorithms and data structures to 
compute the bounding box, perimeter, and neighbors of 
a region using the region tree representation. 
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Boundaries 



Objects and Parts 



Fig. 5. BSDS500 test set. Precision-Recall curves for boundaries [35] (left) and for objects and parts (36) (right). The 
marker on each curve is placed on the Optimal Dataset Scale (ODS), and its F measure is presented in brackets in the 
legend. The isolated red asterisks refer to the human performance assessed on the same image (one human partition 
against the rest of human annotations on the same image) and on a different image (one human partition against the 
human partitions of a different, randomly selected, image). 


Combinatorial Grouping of Proposals: We can 

cast object segmentation as selecting some regions in 
the hierarchy, or in other words, as a combinatorial 
optimization problem on the hierarchy. To illustrate this 
principle. Figure [6^a) shows the simplified representation 
of the hierarchy in Figure [3] Figure [6jb) and (c) show two 
object proposals, and their representation in the region 
hierarchy. 



Fig. 6. Object segmentation as combinatorial opti¬ 
mization: Examples of objects (b), (c), formed by select¬ 
ing regions from a hierarchy (a). 


Since hierarchies are built taking only low-level fea¬ 
tures into account, and do not use semantic information, 
objects will usually not be optimally represented using a 
single region in the hierarchy. As an example. Figure [6^c) 
shows the optimum representation of the car, consisting 
of three regions. 

A sensible approach to create object proposals is there¬ 
fore to explore the set of n-tuples of regions. The main 
idea behind MCG is to explore this set efficiently, taking 
advantage of the region tree representation, via the fast 


computation of region neighbors. 

The whole set of tuples, however, is huge, and so it 
is not feasible to explore it exhaustively Our approach 
ranks the proposals using the height of their regions in 
the tree (UCM strength) and explores the tree from the 
top, but only down to a certain threshold. To speed the 
process up, we design a top-down algorithm to compute 
the region neighbors and thus only compute them down 
to a certain depth in the tree. 

To further improve the results, we not only consider 
the n-tuples from the resulting MCG-UCM-Our hierar¬ 
chy, but also the rest of hierarchies computed at different 
scales. As we will show in the experiments, diversity 
significantly improves the proposal results. 

Parameter Learning via Pareto Front Optimization: 
MCG takes a set of diverse hierarchies and computes the 
n-tuples up to a certain UCM strength. We can interpret 
the n-tuples from each hierarchy as a ranked list of Ni 
proposals that are put together to create the final set of 
N p proposals. 

At training time, we would like to find, for differ¬ 
ent values of N p , the number of proposals from each 
ranked list Ni such that the joint pool of N p proposals 
has the best achievable quality. We frame this learning 
problem as a Pareto front optimization (39), [40] with 
two conflicting objective functions: number of proposals 
and achievable quality. At test time, we select a working 
point on the Pareto front, represented by the {N 
values, based either on the number of proposals N p we 
can handle or on the minimum achievable quality our 
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application needs, and we combine the Ni top proposals 
from each hierarchy list. 

Formally, assuming R ranked lists Li, an exhaustive 
learning algorithm would consider all possible values 
of the i?-tuple {N u ..., N R }, where N t e {0,..., |L*|}; 
adding up to |^v| parameterizations to try, which is 
intractable in our setting. 

Figure [7] illustrates the learning process. To reduce the 
dimensionality of the search space, we start by selecting 
two ranked lists L\, (green curves) and we sample 
the list at S levels of number of proposals (green dots). 
We then scan the full S 2 different parameterizations to 
combine the proposals from both (blue dots). In other 
words, we analyze the sets of proposals created by 
combining the top Ni from L\ (green dots) and the top 
A ^2 from L 2 . 

Ranked lists of object proposals Pareto front 

Ll .... l r reduction of parameters 




Fig. 7. Pareto front learning: Training the combinatorial 
generation of proposals using the Pareto front 

The key step of the optimization consists in discarding 
those parameterizations whose quality point is not in 
the Pareto front (red curve), (i.e., those parameterizations 
that can be substituted by another with better quality 
with the same number of proposals, or by one with the 
same quality with less proposals.) We sample the Pareto 
front to S points and we iterate the process until all the 
ranked lists are combined. 

Each point in the final Pareto front corresponds to a 
particular parameterization {TVi,.. ., Nr}. At train time, 
we choose a point on this curve, either at a given 
number of proposals N c or at the achievable quality 
we are interested in (black triangle) and store the pa¬ 
rameters {TVi,..., TV#}. At test time, we combine the 
{TVi,..., N r } top proposals from each ranked list. The 
number of sampled configurations using the proposed 
algorithm is (R — 1 )S 2 , that is, we have reduced an 
exponential problem (S R ) to a quadratic one. 

Regressed Ranking of Proposals: To further reduce 
the number of proposals, we train a regressor from 
low-level features, as in m Since the proposals are 
all formed by a set of regions from a reduced set of 
hierarchies, we focus on features that can be computed 
efficiently in a bottom-up fashion, as explained previ¬ 
ously. 

We compute the following features: 

• Size and location: Area and perimeter of the can¬ 
didate; area, position, and aspect ratio of the bounding 
box; and the area balance between the regions in the 


candidate. 

• Shape: Perimeter (and sum of contour strength) 
divided by the squared root of the area; and area of the 
region divided by that of the bounding box. 

• Contours: Sum of contour strength at the bound¬ 
aries, mean contour strength at the boundaries; mini¬ 
mum and maximum UCM threshold of appearance and 
disappearance of the regions forming the candidate. 

We train a Random Forest using these features to regress 
the object overlap with the ground truth, and diversify 
the ranking based on Maximum Marginal Relevance 
measures (181 . We tune the random forest learning on 
half training set and validating on the other half. For 
the final results, we train on the training set and evaluate 
our proposals on the validation set of PASCAL 2012. 

6 Experiments on PASCAL VOC, SBD, 
and COCO 

This section presents our large-scale empirical validation 
of the object proposal algorithm described in the previ¬ 
ous section. We perform experiments in three annotated 
databases, with a variety of measures that demonstrate 
the state-of-the-art performance of our algorithm. 

Datasets and Evaluation Measures: We conduct ex¬ 
periments in the following three annotated datasets: the 
segmentation challenge of PASCAL 2012 Visual Object 
Classes (SegVOC12) (41], the Berkeley Semantic Bound¬ 
aries Dataset (SBD) [[42], and the Microsoft Common 
Objects in Context (COCO) [43]. They all consist of 
images with annotated objects of different categories. 
Table [2] summarizes the number of images and object 
instances in each database. 



Number of 
Classes 

Number of 
Images 

Number of 
Objects 

SegVOC12 

20 

2913 

9 847 

SBD 

20 

12031 

32172 

COCO 

80 

123287 

910983 


TABLE 2 

Sizes of the databases 


Regarding the performance metrics, we measure the 
achievable quality with respect to the number of pro¬ 
posals, that is, the quality we would have if an oracle 
selected the best proposal among the pool. This aligns 
with the fact that object proposals are a preprocessing 
step for other algorithms that will represent and classify 
them. We want, therefore, the achievable quality within 
the proposals to be as high as possible, while reducing 
the number of proposals to make the final system as fast 
as possible. 

As a measure of quality of a specific proposal with 
respect to an annotated object, we consider the Jaccard 
index J, also known as overlap or intersection over 
union; which is defined as the size of the intersection 
of the two pixel sets over the size of their union. 
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To compute the overall quality for the whole database, 
we first select the best proposal for each annotated 
instance with respect to J. The Jaccard index at instance 
level ( Ji) is then defined as the mean best overlap for all 
the ground-truth instances in the database, also known 
as Best Spatial Support score (BSS) HJ or Average Best 
Overlap (ABO) E2. 

Computing the mean of the best overlap on all objects, 
as done by Ji, hides the distribution of quality among 
different objects. As an example, Ji = 0.5 can mean 
that the algorithm covers half the objects perfectly and 
completely misses the other half, or can also mean that 
all the objects are covered exactly at J = 0.5. This 
information might be useful to decide which algorithm 
to use. Computing a histogram of the best overlap would 
provide very rich information, but then the resulting 
plot would be 3D (number of proposals, bins, and bin 
counts). Alternatively, we propose to plot different per¬ 
centiles of the histogram. 

Interestingly, a certain percentile of the histogram 
of best overlaps consists in computing the number of 
objects whose best overlap is above a certain Jaccard 
threshold, which can be interpreted as the best achiev¬ 
able recall of the technique over a certain threshold. We 
compute the recall at three different thresholds: J = 0.5, 
J = 0.7, and J = 0.85. 

Learning Strategy Evaluation: We first estimate the 
loss in performance due to not sweeping all the possible 
values of {iVi,... , Nr} in the combination of proposal 
lists via the proposed greedy strategy. To do so, we will 
compare this strategy with the full combination on a 
reduced problem to make the latter feasible. Specifically, 
we combine the 4 ranked lists coming from the single- 
tons at all scales, instead of the full 16 lists coming from 
singletons, pairs, triplets, and 4-tuples. We also limit the 
search to 20 000 proposals, further speeding the process 
up. 

In this situation, the mean loss in achievable quality 
along the full curve of parameterization is Ji = 0.0002, 
with a maximum loss of J^ = 0.004 (0.74%). In exchange, 
our proposed learning strategy on the full 16 ranked lists 
takes about 20 seconds to compute on the training set of 
SegVOC12, while the singleton-limited full combination 
takes 4 days (the full combination would take months). 

Combinatorial Grouping: We now evaluate the 
Pareto front optimization strategy in the training set of 
SegVOC12. As before, we extract the lists of proposals 
from the three scales and the multiscale hierarchy, for 
singletons, pairs, triplets, and 4-tuples of regions, leading 
to 16 lists, ranked by the minimum UCM strength of the 
regions forming each proposal. 

Figure [8] shows the Pareto front evolution of Ji with 
respect to the number of proposals for up to 1, 2, 3, and 
4 regions per proposal (4, 8,12, and 16 lists, respectively) 
at training and testing time on SegVOC12. As baselines, 
we plot the raw singletons from MCG-UCM-Our, gPb- 
UCM, and Quadtree; as well as the uniform combination 
of scales. 



10 3 10 4 10 5 
Number of proposals 


Fig. 10. Region distribution learnt by the Pareto front 
optimization on SegVOC12. 


The improvement of considering the combination of 

all 1-region proposals ( |- ) from the 3 scales and the 

MCG-UCM-Our with respect to the raw MCG-UCM-Our 
ED is significant, which corroborates the gain in diver¬ 
sity obtained from hierarchies at different scales. In turn, 
the addition of 2- and 3-region proposals PD and [=| 
noticeably improves the achievable quality. This shows 
that hierarchies do not get full objects in single regions, 
which makes sense given that they are built using low- 
level features only. The improvement when adding 4- 
tuples ED is marginal at the number of proposals we 
are considering. When analyzing the equal distribution 

of proposals from the four scales \ - } , we see that the 

less proposals we consider, the more relevant the Pareto 
optimization becomes. At the selected working point, the 
gain of the Pareto optimization is 2 points. 

Figure [10] shows the distribution of proposals from 
each of the scales combined in the Pareto front. We 
see that the coarse scale (0.5) is the most picked at 
low number of proposals, and the rest come into play 
when increasing their number, since one can afford more 
detailed proposals. The multi-scale hierarchy is the one 
with less weight, since it is created from the other three. 

Pareto selection and ranking: Back to Figure |8| 
the red asterisk 0 marks the selected configuration 
{N i, ..., Nr} in the Pareto front (black triangle in Fig¬ 
ure 0, which is selected at a practical level of proposals. 
The red plus sign 0 represents the set of proposals after 
removing those duplicate proposals whose overlap leads 
to a Jaccard higher than 0.95. The proposals at this point 
are the ones that are ranked by the learnt regressor ED- 

At test time (right-hand plot), we directly combine the 
learnt {TVi,... ,Ar} proposals from each ranked list. 
Note that the Pareto optimization does not overfit, given 
the similar result in the training and validation datasets. 
We then remove duplicates and rank the results. In this 
case, note the difference between the regressed result in 
the training and validation sets, which reflects overfit¬ 
ting, but despite this we found it beneficial with respect 
to the non-regressed result. 





Jaccard index at instance level (J, ; ) 
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- Pareto up to 4-tuples 

- Pareto up to triplets 

— Pareto up to pairs 

- Pareto only singletons 

- Raw Ours-multi singl. 

■ - Raw gPb-UCM singl. 

-Raw Quadtree singl. 

- Equal distribution 

* Selected configuration 
+ Filtered candidates 
—^ Regressed ranking 


Fig. 8. Pareto front evaluation. Achievable quality of our proposals for singletons, pairs, triplets, and 4-tuples; and 
the raw proposals from the hierarchies on PASCAL SegVOC12 training (left) and validation (right) sets. 
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Fig. 9. Object Proposals: Jaccard index at instance level. Results on SegVOC12, SBD, and COCO. 


In the validation set of SegVOC12, the full set of 
proposals (i.e., combining the full 16 lists) would contain 
millions of proposals per image. The multiscale com¬ 
binatorial grouping allows us to reduce the number of 
proposals to 5 086 with a very high achievable Ji of 0.81 
0. The regressed ranking allows us to further 

reduce the number of proposals below this point. 

Segmented Proposals: Comparison with State of 
the Art: We first compare our results against those 
methods that produce segmented object proposals Ifl3l , 
EL ED, EH EH, EE ED, ED, using the implemen¬ 
tations from the respective authors. We train MCG on 
the training set of SegVOC12, and we use the learnt 
parameters on the validation sets of SegVOC12, SBD, 
and COCO. 

Figure [9] shows the achievable quality at instance level 
(Ji) of all methods on the validation set of SegVOC12, 
SBD, and COCO. We plot the raw regions of MCG- 
UCM-Our, gPb-UCM, and QuadTree as baselines where 
available. We also evaluate a faster single-scale version of 
MCG ( Single-scale Combinatorial Grouping - SCG), which 
takes the hierarchy at the native scale only and combines 
up to 4 regions per proposal. This approach decreases 
the computational load one order of magnitude while 


keeping competitive results. 

MCG proposals E3 significantly outperform the 
state-of-the-art at all regimes. The bigger the database is, 
the better MCG results are with respect to the rest, which 
shows that our techniques better generalize to unseen 
images (recall that MCG is trained only in SegVOC12). 

As commented on the measures description, Ji shows 
mean aggregate results, so they can mask the distribu¬ 
tion of quality among objects in the database. Figure [ll] 
shows the recall at three different Jaccard levels. First, 
these plots further highlight how challenging COCO 
is, since we observe a significant drop in performance, 
more pronounced than when measured by Ji and J c . 
Another interesting result comes from observing the 
evolution of the plots for the three different Jaccard 
values. Take for instance the performance of GOP 0 

against MCG-Our \ - \ in SBD. While for J = 0.5 GOP 

slightly outperforms MCG, the higher the threshold, the 
better MCG. Overall, MCG has specially good results 
at higher J values. In other words, if one looks for 
proposals of very high accuracy, MCG is the method 
with highest recall, at all regimes and in all databases. 
In all measures and databases, SCG Q obtains very 
competitive results, especially if we take into account 
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PASCAL SegVOC12 SBD COCO 
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Fig. 11. Segmented Object Proposals: Recall at different Jaccard levels. Percentage of annotated objects for 
which there is a proposal whose overlap with the segmented ground-truth shapes (not boxes) is above J = 0.5, 
J = 0.7, and J = 0.85, for different number of proposals per image. Results on SegVOC12, SBD, and COCO. 
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Fig. 12. Bounding-Box Proposals: Recall at different Jaccard levels. Percentage of annotated objects for which 
there is a bounding box proposal whose overlap with the ground-truth boxes is above J = 0.5, J = 0.7, and J = 0.85, 
for different number of proposals per image. Results on SegVOC12, SBD, and COCO. 


that it is 7 x faster than MCG, as we will see in next 
sections. 

The complementarity of MCG with respect to other 
proposal techniques, and their combination using the 
Pareto front optimization is studied in (44). 

Boxes Proposals: Comparison with State of the 
Art: Although focused in providing segmented object 
proposals, MCG may also be used to provide boxes 
proposals, by taking the bounding box of the segmented 
proposals and deduplicating them. We add the state 
of the art in boxes proposals [26], (27|, (25), and lH6l 
to the comparison. Figure [12] shows the recall values 
of the boxes results when compared to the bounding 
boxes of the annotated objects, for three different Jaccard 
thresholds. 

While many of the techniques specifically tailored to 
boxes proposals are competitive at J = 0.5, their perfor¬ 


mance drops significantly at higher Jaccard thresholds. 
Despite being tailored to segmented proposals, MCG 
clearly outperforms the state of the art if you look 
for precise localization of the bounding boxes. Again, 
SCG is very competitive, especially taking its speed into 
account. 

MCG and SCG Time per Image: Table [3] shows 
the time per image of the main steps of our approach, 
from the raw image to the contours, the hierarchical 
segmentation, and the proposal generation. All times are 
computed using a single core on a Linux machine. Our 
full MCG takes about 25 s. per image to compute the 
multiscale hierarchies, and 17 s. to generate and rank 
the 5 038 proposals on the validation set of SegVOC12. 
Our single-scale SCG produces a segmentation hierarchy 
of better quality than gPb-ucm t20l in less than 3 seconds 
and with significant less memory footprint. 
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Fig. 13. COCO Qualitative Results: Image, ground truth, multi-scale UCM and best MCG proposals among the 500 
best ranked. (More qualitative examples in the supplemental material.) 



Contour 

Detection 

Hierarchical 

Segmentation 

Candidate 

Generation 

Total 

MCG 

4.6 ±1.3 

20.5 ±5.6 

17.0 ±9.8 

42.2 ± 14.8 

SCG 

1.0 ±0.3 

2.7 ±0.7 

2.6 ± 0.4 

6.2 ± 1.1 


TABLE 3 

Time in seconds per image of MCG and SCG 


Table |4] shows the time-per-image results compared to 
the rest of state of the art in segmented proposals gen¬ 
eration, all run on the same single-core Linux machine. 


Proposal 

Generation 


MCG (Our) 

42.2 ± 14.8 

SCG (Our) 

6.2 ± 

1.1 

GOP |13] 

1.0 ± 

0.3 

GLS |E1' 

7.9 ± 

1.7 

SeSe fl2l 

15.9 ± 

5.2 

RIGOR (±51 
CPMC GH 

31.6 ±16.0 

>120 


ci d] 

ShSh CEO 

>120 

>120 



TABLE 4 

Time comparison for all considered state-of-the-art 
techniques that produce segmented proposals. All run 
on the same single-core Linux machine. 


Practical considerations: One of the key aspects of 
object proposals is the size of the pool they generate. 
Depending on the application, one may need more preci¬ 
sion and thus a bigger pool, or one might need speed and 
thus a small pool in exchange for some loss of quality. 
MCG and SCG provide a ranked set of around 5 000 and 
2000 proposals, respectively, and one can take the N 
first in case the specific application needs a smaller pool. 
From a practical point of view, this means that one does 
not need to re-parameterize them for the specific needs 
of a certain application. 

In contrast, the techniques that do not provide a 
ranking of the proposals, need to be re-parameterized 
to adapt them to a different number of proposals, which 
is not desirable in practice. 

On top of that, the results show that MCG and SCG 
have outstanding generalization power to unseen images 
(recall that the results for SBD and COCO have been 
obtained using the learnt parameters on SegVOC12), 
meaning that MCG and SCG offer the best chance to 
obtain competitive results in an unseen database without 
need to re-train. 

Figure [13] shows some qualitative results on COCO. 

7 Conclusions 

We proposed Multiscale Combinatorial Grouping 
(MCG), a unified approach for bottom-up segmentation 
and object proposal generation. Our approach produces 
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state-of-the-art contours, hierarchical regions, and object 
proposals. At its core are a fast eigenvector computation 
for normalized-cut segmentation and an efficient 
algorithm for combinatorial merging of hierarchical 
regions. We also present Single-scale Combinatorial 
Grouping (SCG), a speeded up version of our technique 
that produces competitive results in under five seconds 
per image. 

We perform an extensive validation in BSDS500, 
SegVOC12, SBD, and COCO, showing the quality, ro¬ 
bustness and scalability of MCG. Recently, an indepen¬ 
dent study [45], [46] provided further evidence to the 
interest of MCG among the current state-of-the-art in 
object proposal generation. Moreover, our object candi¬ 
dates have already been employed as integral part of 
high performing recognition systems f47l . 

In order to promote reproducible research on percep¬ 
tual grouping, all the resources of this project - code, 
pre-computed results, and evaluation protocols - are 
publicly availably 
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